This notebook presents the results of the Data Analysis Part of a Master Thesis conducted by Hugo Moreau (hugo.moreau@epfl.ch) within UPC Cablecom - Data management team. The project aimed at building a proof of concept regarding the use of Machine Learning in order to predict customer premises equipment failure.
It aims at providing a very thorough understanding of the different choices that had to be taken during this phase and will therefore contain some code repetition for the sake of explaining correctly our line of thought.
# Some important imports
%matplotlib inline
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from mpl_toolkits.mplot3d import Axes3D
sns.set_context('notebook')
# Sklearn imports
import sklearn
from sklearn import calibration
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
# Some imports that may require package installation
try:
import missingno as msno
except ModuleNotFoundError:
print('You need to run: pip install missingno')
# Own Scripts import
from scripts.energy_test_DP import *
from scripts.utils import *
from scripts.preprocessing import *
from scripts.plot import *
from scripts.model_selection import *
# get rid of warning due to deprecated modules in sklearn
import warnings
warnings.simplefilter('ignore')
# Constants
DATA_FOLDER = './Data'
Before completed Data Collection, we decided that it would be interesting to look at whether we could find any influence of the fact that Day_0 is a weekend day or not on the vector of measurements. This would potentially allow us to decide to exclude such Day_0 from our following analysis.
Because data collection was not finished at such an early stage, we have decided to conduct the analysis on a subsample of the nodes. To to this we arbitrarily selected two service groups over a week ('ch-mbcALT301','ch-mbcARA301')
DATA_PATH = DATA_FOLDER+'/initial_non_aggregated.xlsx'
df = excel_to_df(DATA_PATH,DATA_FOLDER+'/extracted.pk')
df.head()
There exist three types of information in the dataset:
static_infos: information concerning the CPE vector itself, the MAC address, account_number, CMTS and Service group, as well as information regarding the dates over which the vector was computed.measurements_non_miss: the real indicators that are polled using SAA either from the CPE itself or from the CMTS. measurements_miss: because we aggregate at the database level we wanted to keep the information regarding the missing measurement as they could be useful to the classifier at a later stage. We have therefore for each aggregate from measurements_non_miss added the proportion of values from the aggregates that were missing.cols = list(df.columns)
static_infos = ['MAC','ACCOUNT_NUMBER','CMTS','SERVICE_GROUP','FIRST_DAY','LAST_DAY','DAY_TYPE']
measurements = [x for x in cols if x not in static_infos]
measurements_non_miss = [x for x in measurements if not('MISS' in x)]
print('We have collected {:d} measurements'.format(len(measurements)))
As an initial naive approach we considered a KS-test on each individual measurement. This test is not so meaningful as it doesn't consider our vector as a multi-variate random variable but takes each dimension separately. Nevertheless as an initial approach it could help us explore each measurement.
meaning of this test :
"If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same."
weekday_df = df[df.DAY_TYPE == 'WEEKDAY']
weekend_df = df[df.DAY_TYPE == 'WEEKEND']
values_weekday = weekday_df['OFFLINE_PCT_6H'].values
values_weekend = weekend_df['OFFLINE_PCT_6H'].values
stats.ks_2samp(values_weekday, values_weekend)
On this example were we try to decide whether the distribution of OFFLINE_PCT_6H during the weekend and the week is different, we see that the p-value is almost maximal and we cannot reject the null hypothesis at any reasonable confidence level.
We will compute such statistic for all the different variable and explore the ones with the smallest p-value to see if the human eye can distinguish the two distributions.
test_results = get_ks_test_result(weekday_df,weekend_df,measurements)
test_results.head(10)
sorted_mes = list(test_results.index)
Mow that we can see that some p-value are fairly low (for exemple for MISS_SNR_UP_18H we can reject the null hypothesis with significance level less than $0.01\%$. We are sure that the distributions are different at $99.99\%$. Therefore we wanted to observe graphically how this translates in term of distribution.
plot_difference(sorted_mes[0],weekday_df,weekend_df)
plot_difference(sorted_mes[1],weekday_df,weekend_df)
plot_difference(sorted_mes[2],weekday_df,weekend_df)
Unfortunately, due to the number of days that are observed for week ends and for week days we cannot be sure that this result can actually be trusted as it is possible that it simply comes from the fact that we have unbalanced classes. But also we see that here because the MISS_* measurements are simply the ratio over 6h windows of missing measurements, it isn't a continuous variable it is hard to conclude on the ks_test (Also, from the documentation we can see that the test assumes continuous variables and therefore has non value here).
By observing the distributions we see that we have a pattern, for the 3 cases, for weekdays most of the samples are 0 while about $40\%$ of them are $16.7$ for weekends. which translates in having just one missing value over a 6h window which isn't necessarily a noticeable difference.
Therefore we decided to look at measurement that are real measurement and not just missing ratios.
test_results_no_miss = get_ks_test_result(weekday_df,weekend_df,measurements,with_miss_mes=False)
sorted_mes_no_miss = list(test_results_no_miss.index)
test_results_no_miss.head(3)
plot_difference(sorted_mes_no_miss[0],weekday_df,weekend_df)
plot_difference(sorted_mes_no_miss[1],weekday_df,weekend_df)
plot_difference(sorted_mes_no_miss[2],weekday_df,weekend_df,n_bins=10000)
All these plots do not display a significant difference between the two dataset and especially the very last one. This is why we will try to look at the cumulative distribution
f, axes = plt.subplots(1, 2, figsize=(18, 5), sharex=True , sharey =True)
axes[0].set_xlim(-0.005,0.010)
axes[1].set_xlim(-0.005,0.010)
weekday_df['CER_DN_18H'].dropna().hist(cumulative=True, normed=1, bins=10000,histtype='step',ax=axes[0])
weekend_df['CER_DN_18H'].dropna().hist(cumulative=True, normed=1, bins=10000,histtype='step',color='red',ax=axes[1])
f.show()
But again the difference isn't so clear here and we cannot necessarily exclude the fact that these come from different distribution as this difference could arise from the lack of data.
The problem with the previous test is that it considers only each dimension independently but we are interested in determining if the day_type influences our vector. Some links of interest :
The third link seems to be a good approach for our use case. Nevertheless it will ask for some numpy tricks as it can be quite computationally expensive if it isn't done properly, therefore we use some Dynamic programming (in order to not recompute the euclidean distance between each vector multiple times) and numpy fancy indexing in order to use already implemented optimization.
All the functions used to perform what we will refer as the energy test are implemented in the energy_test_DP.py file.
Now we test briefly that the method does indeed detect different and equal sampling distributions
DISTANCE_MATRIX_FOLDER = 'Distance_matrices/'
# Here it shouldn't reject the null hypothesis as the two are sampled from the same distribution
path = DISTANCE_MATRIX_FOLDER + 'same_distrib'
x = np.stack([np.random.exponential(size=200),np.random.normal(size=200)])
y = np.stack([np.random.exponential(size=200),np.random.normal(size=200)])
E_distance_replicates = energy_two_sample_test(x,y,499,1,distance_matrix_bkp=path)
# While here it should reject for obvious reasons
path = DISTANCE_MATRIX_FOLDER + 'different_distrib'
x = np.stack([np.random.exponential(size=200),np.random.normal(size=200)])
y = np.stack([np.random.normal(size=200),np.random.normal(size=200)])
E_distance_replicates = energy_two_sample_test(x,y,499,1,distance_matrix_bkp=path)
We can see that naive testing is conclusive with is encouraging regarding the successful implementation of the test discussed in the paper.
day_type averages to have a contained dataset size¶The size of the distance matrix on the full dataset makes it very long to run and hard to store which is why we first will compute two datasets of equal size (we average a given CPE vector over weekdays and weekend days)
weekday_df_avg = weekday_df.groupby(by=['MAC','ACCOUNT_NUMBER','CMTS','SERVICE_GROUP','FIRST_DAY','LAST_DAY']).mean()
weekend_df_avg = weekend_df.groupby(by=['MAC','ACCOUNT_NUMBER','CMTS','SERVICE_GROUP','FIRST_DAY','LAST_DAY']).mean()
path = DISTANCE_MATRIX_FOLDER + 'day_type_avg'
x = weekday_df_avg[measurements].dropna().values.T
y = weekend_df_avg[measurements].dropna().values.T
E_distance_replicates = energy_two_sample_test(x,y,99,1,distance_matrix_bkp=path)
#used to run in 504s when computing the distance matrix
The test rejects the null hypothesis and we can see that the bootstrapped energy estimates are on a much lower scale than the observed one supporting even more this result
E_distance_replicates.sort()
min_ = E_distance_replicates[0]
max_ = E_distance_replicates[-1]
print('Replicates range in: [{},{}]'.format(round(min_,3),round(max_,3)))
We could challenge this result given the very high dimension of the vectors that could cause the euclidean dimension to overshoot.
Running on the full dataset was not possible so rather than taking all the samples that are available to us we will sample from the full to obtain random samples from the weekday and weekend to perform the test efficiently.
The dataset has:
Because we sample we cannot store the distance matrix and need to recompute it each time.
x = weekday_df[measurements].dropna().values.T
y = weekend_df[measurements].dropna().values.T
# and we sample it without replacement
sample_size = 10000
x_samp = x[:,np.random.choice(x.shape[1],sample_size,replace=False)]
y_samp = y[:,np.random.choice(y.shape[1],sample_size,replace=False)]
E_distance_replicates = energy_two_sample_test(x_samp,y_samp,99,1,use_bkp=False)
As a control we would like to run the same test but on the population that we suppose are from the same distribution.
week_sample = x[:,np.random.choice(x.shape[1],2*sample_size,replace=False)]
x_samp = week_sample[:,:sample_size]
y_samp = week_sample[:,sample_size:]
E_distance_replicates = energy_two_sample_test(x_samp,y_samp,99,1,use_bkp=False)
sample_size = 5000
weekend_sample = y[:,np.random.choice(y.shape[1],2*sample_size,replace=False)]
x_samp = weekend_sample[:,:sample_size]
y_samp = weekend_sample[:,sample_size:]
E_distance_replicates = energy_two_sample_test(x_samp,y_samp,99,1,use_bkp=False)
We will now try to use a more meaningful test where we perform multiple test on the full dataset using multiple samples to obtain a more meaningful p-value averaging away the potential variations due to the dataset.
x = weekday_df[measurements].dropna().values.T
y = weekend_df[measurements].dropna().values.T
energy_two_sample_large_dataset(x,y)
We still reject the Null hypothesis. And we keep our control experiments where we run it on two samples from the same population and observe that the Null Hypothesis cannot be rejected.
energy_two_sample_large_dataset(x,None,sample_size=5000,similar=True)
energy_two_sample_large_dataset(y,None,sample_size=5000,similar=True)
The test seems to indicate that the time of the week of Day_0 does indeed influence the vectors of aggregate on the last 24h (measurements regarding Day_0). We will nevertheless consider all days as equivalent to be selected as Day_0 and will at a later stage decide whether to keep them or not. Once we will be classifying, such choice will show results that are more easily interpretable.
Because we will be able to have meaningful insights regarding how getting rid of those can improve the scores.
Thanks to a framework provided by Business Intelligence, we didn't have to perform the usual plotting of all the 291 features as a report (that is linked to the thesis) can be automatically generated
# We set some pandas options to improve readability when a dataframe is pretty printed.
pd.set_option('max_columns', 290)
pd.set_option('max_colwidth', 5000)
pd.set_option('display.date_dayfirst', True)
Because the project is done while data keeps being collected, we develloped some tools that allow us to import easily the data that we extract from our Backup table.
As it can be observed in the signature of import_sample, the helper function will read the excel file containing the sample, convert each feature to the correct formats, convert the hardware models to indices, keep the vectors from dates where there exist at least one sick cpe, but most importantly it will serialize the result to allow for faster future imports
DATA_PATH = DATA_FOLDER + '/sample_27_04.xlsx'
# We use the help function in order to obtain the dataframe in a correct format.
df = import_sample(DATA_PATH,DATA_FOLDER+'/sample_27_04.pk')
df.head()
# We also identify the columns that are going to be used in later stages.
non_feature_cols = ['mac','day_0',
'cly_account_number',
'saa_account_number',
'cmts','service_group',
'seq_id','milestone_name']
feature_cols = [x for x in list(df.columns) if x not in non_feature_cols]
x = df[feature_cols]
null_counts = 100*x.isnull().sum()/x.count()
print("Number of null values in each column:\n{}".format(null_counts.sort_values(ascending=False)[:6]))
We want to look at missing values. The module missingno allows us to visualize the missing values as white lines on a black matrix. Nevertheless it is complicated to use such package with such a dataset as we have a lot of features making it hard to see anything.
msno.matrix(x)
In this matrix we can see in white the missing values, and we can see that somehow some features are NULL together which may be an interesting pattern. Nevertheless, considering the number of columns in our DataFrame it is fairly hard to study such pattern
For large datasets, it might be hard to investigate on trends for missing values, therefore we use a dendrogram, the missingno package is said to:
allow[...] you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heat-map.
and also regarding its interpretation:
To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on.
fig = msno.dendrogram(x)
What is reassuring regarding this dendogram is that we can see that most of the leaves group together measurements that are aggregated over the time period. (e.g. all aggregates over $day_{-4}$ are grouped together: *_4d ). This shows that as expected when a router is offline most of the corresponding features on that particular day are missing (we say most because some measurements are also polled at the cmts level and concern the cmts not only the CPE).
We would also like to look at the features that are highly correlated, given the very high dimensionality of the dataset this could potentially allow us to reduce the dimension of the vectors if we find features that are sufficiently correlated.
To perform such analysis we can use the most recent sample that is available.
s_day = '11_05'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'
extracted = import_sample(DATA_PATH,BACKUP_PATH)
dates = extracted['day_0'].values
# Keep only the features that are used for prediction (delete the MAC, Service Group etc.)
x_extracted, y_extracted = get_ml_data(extracted)
# Then we perform some initial transformation on the data that will be explained at a later time
x_encoded = encode_categorical(x_extracted)
x_no_missing = impute_missing(x_encoded,method='mean')
An initial analysis highlighted that some columns were exactly similar: during data collection, an error was made and a value was saved under two different columns for many measurements. Therefore in the dataset it translates in columns that are exactly identical.
dropped_features = []
correlated_lists,identical_lists = find_correlation(x_no_missing)
identical_lists
The error that seems to have happened is the fact that we have added a miss_*_1d for which we already have a feature called miss_* for all the 9 cpe measurements. Therefore deleting them is non destructive and can be done without much thinking.
The choice of which to drop is purely arbitrary and we will keep the feature with the shortest name as the only condition.
to_drop = []
for k in identical_lists:
v = list(identical_lists[k])[0]
drop = v if (len(k) < len(v) and k not in to_drop) else k
to_drop.append(drop)
dropped_features = dropped_features + to_drop
x_no_missing = x_no_missing.drop(labels=to_drop,axis=1)
Now, we should have deleted all identical features from the dataset and we can run the analysis once more to have the highly correlated features among the remaining ones.
correlated_lists,identical_lists = find_correlation(x_no_missing)
print('We have {} identical feature-pairs'.format(len(identical_lists)))
Now that all identical features have been deleted we can look into highly correlated features.
print('We have {} groups of correlated features'.format(len(correlated_lists)))
correlated_lists
Looking at this list we can see that their is a pattern that we can first investigate. The way we built the features, we have taken the same measurement aggregated for different time periods: therefore if two measurements are originally highly correlated, each time version should as well.
From the list of correlated features we see that it seems like those underlying variables are identical. One can think of those as a clique of variables:
cmts_ms_utilization_up <---> cmts_utilization_upmiss_pct_traffic_sdmh_up <---> miss_pct_traffic_dmh_upmiss_snr_dn <---> miss_rx_dn <---> miss_tx_upmiss_snr_up <---> miss_rx_upwe will first work on those "cliques" such before going further. Indeed these are the correlation that make the most sense as it would be strange that two measurements X and Y are correlated originally but their aggregates over a day isn't.
suffixes_non_miss = ['','_6h','_12h','_18h','_1d','_2d','_3d','_4d','_5d']
suffixes_miss = ['','_6h','_12h','_18h','_24h','_2d','_3d','_4d','_5d']
The easiest way to perform such exploration would be to create a function that can create multiple pair_plot for each key of the correlated_lists plotting it against each element of the corresponding value-set: this would indeed be very useful in the case of the MISS_ features as they take a finite set of values but also the others as the scatter plot doesn't inform on the concentration of data points (where an hexbin graph would have).
Nevertheless we just want to perform some visual exploration and do not particularly need the population density but rather to see how the joint data points lie w.r.t to the identity line.
compare_candidate_identical('cmts_ms_utilization_up' ,['cmts_utilization_up'],suffixes_non_miss,x_no_missing)
The visual inspection confirms our hypothesis that cmts_ms_utilization_up <---> cmts_utilization_up
compare_candidate_identical('miss_pct_traffic_sdmh_up',['miss_pct_traffic_dmh_up'],suffixes_miss,x_no_missing)
Again a very simple inspection confirms the hypothesis that miss_pct_traffic_sdmh_up <---> miss_pct_traffic_dmh_up
For the third group we will look at:
miss_snr_dn is with each of miss_rx_dn and miss_tx_upmiss_rx_dn and miss_tx_up relatecompare_candidate_identical('miss_snr_dn',['miss_rx_dn','miss_tx_up'],suffixes_miss,x_no_missing)
compare_candidate_identical('miss_rx_dn',['miss_tx_up'],suffixes_miss,x_no_missing)
Pearson correlation always being higher than $0.98$ we are sure that the three variables are indeed a clique in terms of correlation.
compare_candidate_identical('miss_snr_up',['miss_rx_up'],suffixes_miss,x_no_missing)
We have shown that we are indeed in the presence of cliques in each 4 cases (where a connection is set when the two variables are highly correlated) and therefore we need to keep each time only one element per clique
This choice of the feature that we wish to keep, will again be very arbitrary and we do not justify it.
to_drop = ['cmts_ms_utilization_up' + s for s in suffixes_non_miss] + \
['miss_pct_traffic_sdmh_up' + s for s in suffixes_miss] + \
['miss_rx_dn' + s for s in suffixes_miss] + \
['miss_tx_up' + s for s in suffixes_miss] + \
['miss_snr_up' + s for s in suffixes_miss]
dropped_features = dropped_features + to_drop
x_no_missing = x_no_missing.drop(labels=to_drop,axis=1)
x_no_missing.shape
Now that we have taking out all this cases that unveiled an underlying pattern we can look at particular cases. To do that we run the test again on the remaining DataFrame.
correlated_lists,identical_lists = find_correlation(x_no_missing)
print('We have {} identical feature-pairs'.format(len(identical_lists)))
print('We have {} groups of correlated features'.format(len(correlated_lists)))
correlated_lists
Looking at the list of correlated variables we can see that one of the main problems is that this time we do not have each time dimension of the two features matching. Therefore it means that this correlation might hold only on this particular dataset but maybe not on future values.
we will look at these cases individually. (Two plots are generated, one with bins colored according to the log cardinality of points and the other with standard bins in order to have a better idea of the density of clusters in the case of high number of points).
sns.jointplot("miss_rx_up_24h", "offline_pct_24h", data=x_no_missing, kind="hex", joint_kws = {'bins':'log'})
sns.jointplot("miss_rx_up_24h", "offline_pct_24h", data=x_no_missing)
sns.jointplot("miss_rx_up_6h", "offline_pct_6h", data=x_no_missing, kind="hex", joint_kws = {'bins':'log'})
sns.jointplot("miss_rx_up_6h", "offline_pct_6h", data=x_no_missing)
sns.jointplot("miss_snr_dn", "miss_rx_up", data=x_no_missing, kind="hex", joint_kws = {'bins':'log'})
sns.jointplot("miss_snr_dn", "miss_rx_up", data=x_no_missing)
Looking at the plots we can see that the high correlation of those pairs of features reside in the fact that most of the samples are located near the $(0,0)$ point and therefore we decide not to consider them as highly correlated (if they really were we should also see correlation for other time periods - like we saw in previously deleted features).
So we will keep a list of features that we wish to leave out of our analysis for later.
print('By analysing correlated features we have succeeded to get rid of {} features ({:.3f}% reduction).'.format(
len(dropped_features), 100*len(dropped_features)/x_encoded.shape[1]))
It is also interesting to look at columns which variance is low. Indeed we wish to investigate predictors that are have both of the following characteristics:
columns_to_investigate = nearZeroVar(x_no_missing)
columns_to_investigate
As a first observation we can say that all these columns are discrete (since they are mostly percentages out of 6 measurements). Then there is also the model_* features that are One hot encoding and for which we therefore expect a low variance.
We will focus on the features starting by pct_traffic_* as they are indeed percentages but we are not sure in what domain are they expressed.
sns.distplot(x_no_missing['pct_traffic_dmh_up'], bins=20, kde=False, rug=True);
counts = x_no_missing['pct_traffic_dmh_up'].value_counts()
total = counts.sum()
(100*counts/total).head(4)
Once we are looking into the distribution of the variable we are quite sceptic regarding how deleting this feature would make sense so we decide to leave it for a later feature selection round (once we have a classifier we csan use it to discard the features that are not very meaningful by ranking them).
We haven't found any conclusive results regarding low variance features
Before the data can be used to perform any machine learning task we need to modify a bit its form. And for this we will describe our pipeline
y = df['milestone_name']
x = df[feature_cols]
We will convert the milestone name because we want to interpret it as :
NaN then it means that the CPE is supposed to be healthy (0)sick (1)y_binary = y.isnull().map(lambda x: 0 if x else 1)
Has been implemented in preprocessing.convert_to_binary_labels()
As per the Wikipedia definition
one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)
In ML, one hot encoding is often used when we are in the presence of a categorical feature. The amplitude of the feature should not be used by the regressor but rather the value it takes. For example, if we consider the feature that was constructed during the import of the data: weekday
print('Different types :\t',set(x.dtypes.values))
print('Discrete features :\t',list(x.dtypes[x.dtypes == 'int64'].index))
x['weekday'].value_counts().sort_index()
It takes values ranging from 0 to 6 as there are only 7 days in a week. The classifier should not take into consideration the value of this feature given a weight (which would mean that a high value of weekday may give you an indicator of failure or not). This is why we construct 6 new features: wk_{0,...6} such that wk_i$ = 1$ iff the weekday is 1.
Hence the classifier can give a weight to the fact that the day is Tuesday.
We do this for both the hardware_model and the weekday as they are the only two categorical features.
This has been implemented in preprocessing.encode_categorical
We can see that NULL values are quite present in our dataset and we need a strategy to handle them. Many imputation techniques exists. They are often use to hide those missing values (replacing them by the mean of the feature for example or by a 0). Here we wish our missing values to be meaningful as they could really indicate a problem with this particular CPE, this scenario is often referred to as Missing Not at Random (MNAR).
Nevertheless we have already added to each feature a corresponding MISS_ feature that gives us an indication on the percentage of samples that are missing for the feature computation (because usually those features are aggregates of multiple values).
Therefore we wish to hide as much as possible irregularities of such missing value and will use the mean replacement.
Because we replace by the mean, we will need to always use the mean of the Training set in order to impute values both in the training and testing set. scikit-learn proposes a neat tool to perform this operation everytime we fit and predict (using a pipeline that takes all the steps necessary to transform the data as well as learners)
# the imputer module already provides a way to easily replace missing values by the mean of a particular column.
mean_null_imputer = sklearn.preprocessing.Imputer(strategy='mean')
For now the scaling of the data will be fairly arbitrary. We may at a later time compare different scaling on a given model as it might alter performance (some classifiers like Gradient Boosting do not always perform better when scaling the features, while others like a Linear Regression would perform poorly with scaling).
We have three common choices for normalizing the data:
# Again sklearn has this transformer already built in and can be used with pipelines
standardizer = sklearn.preprocessing.StandardScaler()
The high dimensionality of the dataset we created has already been mentioned above. This is why we would like to keep features that encapsulates the most information and we will try to look at different tools available to perform such analysis.
If we look at the definition from our favorite collaborative encyclopedia:
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
Therefore for it will compute a new basis for each feature to be represented such that the new 'transformed' features are linearly uncorrelated.
Then there exist a trade-off between the dimensionality of this new basis (the number of components that are considered in the resulting space) and the explained variance of the dataset that will remain in this new representation (compression will result in information loss as one could expect)
Dimensionality reduction usually happens after preprocessing as the last step and therefore we first need to pre-process the data. We will import the data using the Preprocessing functions we have designed to handle all the steps that have been discussed
s_day = '27_04'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'
# We use the help function in order to obtain the dataframe in a correct format.
extracted = import_sample(DATA_PATH,BACKUP_PATH)
# Extract the raw data
x_extracted, y_extracted = get_ml_data(extracted)
# Perform binarization of labels and encoding of categorical values
y = convert_to_binary_labels(y_extracted)
x_encoded = encode_categorical(x_extracted)
# remove correlated features
x_df = remove_features(x_encoded,verbose = True)
x = x_df.values
# preprocessing pipeline construction, impute null values and scale inputs
preproc = make_pipeline(
sklearn.preprocessing.Imputer(strategy='mean'),
sklearn.preprocessing.StandardScaler())
X = preproc.fit_transform(x)
# We want to keep the features that allow us to explain 95% of the variance.
pca = PCA()
pca.fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
The curve isn't very steep which isn't quite a good news for us given that it forces us to use a lot of components in order to encapsulate a good part of the variance.
Thanks to such tool we can look at our dataset in reduced dimensional spaces: 2D and 3D. It would be interesting to see if we can distinguish between Healthy and Sick CPEs graphically in such space:
x_2D = PCA(2).fit_transform(X)
x_2D_healthy = x_2D[y == 0]
x_2D_sick = x_2D[y == 1]
fig, axes = plt.subplots(nrows=1, ncols=2, sharex = True, sharey = True,figsize=(20, 5))
axes[0].hexbin(x_2D_healthy[:,0],x_2D_healthy[:,1],bins='log',mincnt=1)
axes[0].set_title('Healthy CPE vectors')
axes[1].hexbin(x_2D_sick[:,0],x_2D_sick[:,1],bins='log',mincnt=1)
axes[1].set_title('Sick CPE vectors')
fig.suptitle('2D representation of CPE vectors (PCA red.)')
plt.show()
Visually the difference between the two population isn't evident.
x_3D = PCA(3).fit_transform(X)
threedee = plt.figure().gca(projection='3d')
threedee.scatter(x_3D[:,0], x_3D[:,1], x_3D[:,2])
3D representation also doesn't seem to showing much information.
Another technique that can be used to perform dimensionality reduction would be to use LDA.
It is a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events
With such technique the use of standardization isn't necessary. Also because it attempts to separate the two classes, it is a supervised method (it uses the labels to perform the transformation)
preproc_no_scale = make_pipeline(
sklearn.preprocessing.Imputer(strategy='mean'))
X_prime = preproc_no_scale.fit_transform(x)
LDA = sklearn.discriminant_analysis.LinearDiscriminantAnalysis()
x_prime_1D = LDA.fit_transform(X,y)
x_prime_1D.shape
plotting_df = pd.DataFrame(x_prime_1D,columns=['LDA comp'])
plotting_df['label'] = pd.Series(y).map(lambda x : 'sick' if x == 1 else 'healthy')
sns.boxplot(x= 'label',y="LDA comp", data=plotting_df);
Again it is yet unknown whether such dimensionality reduction will help further classification. Therefore we will try both on our model to see which one performs optimally depending on the model.
Data collection yielded a fairly high level of doubt regarding the way that we label the data for many reasons. Because these reasons will be explained in depth in the thesis we will focus on high-level hints of why our labeling may be wrong:
Therefore our initial idea was to cluster the data points and use the knowledge of an HFC support expert to determine the labels of each cluster. This is why we started clustering the data.
Before presenting the data to our expert it made sense to ensure that our clustering was indeed discovering an underlying structure in the data as it would be useless to sample from a cluster that is indeed not even clustering data properly
In order to determine the quality of the clustering (using K-means) for a given number of cluster we will use the a Silhouette analysis as suggested by scikit-learn documentation.
Interpreting these plots resides in two characteristics:
And also an apparently commonly accepted measure is to use the following scale for the silhouette score average:
# no dimensionality reduction
scripts.plot.silhouette_analysis(list(range(2,8)),X)
Interpretation of the coefficient values : The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
explained_variance = 0.95
pca = PCA(explained_variance)
pca.fit(X)
dimensions = len(pca.components_)
print("We retained {} dimension(s) to explain {}% of the variance".format(dimensions,100*explained_variance))
# applying a PCA retaining 95% of variance
scripts.plot.silhouette_analysis(list(range(2,8)),pca.fit_transform(X))
The average silhouette analysis are not so encouraging nevertheless we cannot really be sure that we are not under the curse of dimensionality as we can see that decreasing the number of components increase the silhouette average scores.
We will also try to see whether we can apply the elbow method to find the correct number of clusters (on a much wider range of clusters).
plot_silhouette_score(list(range(2,40)),pca.fit_transform(X))
From these plots we can observe:
One of the assumptions of K-Means is that clusters should have approximately the same size, therefore we try to subsample the 'believed healthy' population.
X_s,y_s = get_balanced_classes(X,y)
scripts.plot.silhouette_analysis(list(range(2,8)),X_s)
K_means = sklearn.cluster.MiniBatchKMeans(n_clusters=2,batch_size=1000)
labels = K_means.fit_predict(X_s)
analyse_clustering(y_s,labels)
What makes us even more confident about the incorrectness of our clustering is that by running the clustering multiple times doesn't yield the same results and therefore the clustering can be considered as somehow random.
We will compare different levels of retained variance for different clusters with balanced classes. We will be tracking the silhouette score as it seems to give us a fairly good estimate of the quality of the cluster.
plot_silhouette_score_different_PCA(X,[0.3,0.5,0.7,0.9],list(range(2,15)),True, y)
Supposedly LDA could allow to perform better clustering as it computes a linear combination of the features in order to separate as much as possible the two classes.
LDA = sklearn.discriminant_analysis.LinearDiscriminantAnalysis()
K_means = sklearn.cluster.MiniBatchKMeans(n_clusters=2,batch_size=600)
labels = K_means.fit_predict(LDA.fit_transform(X_s,y_s))
analyse_clustering(y_s,labels)
get_single_silhouette(X_s,labels,2)
While the precision and Recall of such clustering are promising we are a bit puzzled by the silhouette score that is alarmingly low which we cannot really explain. And we concluded that LDA could definitely be a great tool for further classification. Nevertheless the fact that the recall is only 50% was alarming as we do not expect 50% of our 'sick' CPEs to be indeed not sick.
Given our inability to cluster the data with a meaningful structure we decided to move directly to the design of a classifier. This was done for multiple reasons but mainly because changing the labels based on the results of clustering would not be scientifically correct . We didn't have the possibility to check a significant number of MACs from each cluster to ensure that the clustering was conclusive and therefore would have possibly overestimated the performance of the classifier as we would have introduced an bias that artificially biased our results.
Also we put back in perspective the goal of the project for UPC which was to display the evidence of an underlying structure in the data that would later give birth to a project where we would start modifying the data that is being collected in order to build a proper prediction model.
Because we didn't know what models would fit the best for our task we decided to begin with an extensive exploration of the most famous models to perform a binary classification.
Also in order to compare different models we focused on Receiver Operating Characteristic curve (ROC) of different classifiers. This curve orders predictions based on their score (how sure we are that they belong to the 'sick' group), then for each prediction it will go up if the prediction is correct and right if it isn't. Therefore, the identity line ($x=y$) corresponds to a random classifier.
We are interested in classifier with the steepest curve near zero as it would mean that our top predictions are very accurate (as this would mean that we could order our predictions and then handle only the top $~15\%$). This is in the spirit of our will to show the existence of an underlying structure.
x,y,_ = usable_data('27_04',DATA_FOLDER)
Our initial assumption was that we could put all the vectors in a dataset regardless of the Day_0 at which they were collected and then cross validate on this dataset. This assumption turned out to be wrong at a later stage but to be complete in our description of our approach we will still present the result of this exploration.
In order to build ROC curves that we can compare and also obtain unbiased estimates of the performance of the classifiers we will compute those two on multiple folds and then obtain the mean and Confidence intervals to determine the stability of the model.
# we said we would try the two preprocessing steps with LDA or PCA
PCA_preproc = make_pipeline(
sklearn.preprocessing.Imputer(strategy='mean'),
sklearn.preprocessing.StandardScaler(),
PCA(0.95))
LDA_preproc = make_pipeline(sklearn.preprocessing.Imputer(
strategy='mean'), LinearDiscriminantAnalysis())
# We set the ratio of top predictions on which we wish to compute the metrics
TOP_RATIO = 0.2
results = []
create_result_dict = lambda name,prec,rec,F_one,p: {'Method':name, "Precision":prec,"Recall": rec,"F1-Score":F_one, "Parameter":p}
n_est_range = [10,100,1000,5000]
name,prec,rec,F_one,p = roc_curves(n_est_range,
lambda param: make_pipeline(
PCA_preproc,
RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1)),
x,y,
'Random Forests with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves(n_est_range,
lambda param: make_pipeline(
LDA_preproc,
RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1)),
x,y,
'Random Forests with LDA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
pen_range = np.linspace(0.1, 4, 4)
name,prec,rec,F_one,p = roc_curves(pen_range,
lambda param: make_pipeline(PCA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1', dual=False, C=param, random_state=42, max_iter=1000))),
x, y,
'Linear SVM with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
pen_range = np.logspace(-3, 0, 4)
name,prec,rec,F_one,p = roc_curves(pen_range,
lambda param: make_pipeline(LDA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1', dual=False, C=param, random_state=42, max_iter=1000))),
x, y,
'Linear SVM with LDA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
pen_range = np.logspace(-0.5, 0, 4)
name,prec,rec,F_one,p = roc_curves(pen_range,
lambda param: make_pipeline(
PCA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param))),
x, y,
'Non Linear SVM with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves(pen_range,
lambda param: make_pipeline(
LDA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param))),
x, y,
'Non Linear SVM with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves([10,100,500,1000],
lambda param: make_pipeline(
PCA_preproc,
sklearn.ensemble.GradientBoostingClassifier(n_estimators=param)),
x,y,
'Gradient Boosting with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves([10,100,500,1000],
lambda param: make_pipeline(
LDA_preproc,
sklearn.ensemble.GradientBoostingClassifier(n_estimators=param)),
x,y,
'Gradient Boosting with LDA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves(np.linspace(1,100,20),
lambda param: make_pipeline(
PCA_preproc,
sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param))),
x,y,
'Gradient Boosting with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves(np.linspace(1,100,20),
lambda param: make_pipeline(
LDA_preproc,
sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param))),
x,y,
'Gradient Boosting with LDA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves(np.logspace(-3,2,10),
lambda param: make_pipeline(
PCA_preproc,
sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1)),
x,y,
'Logistic Regression with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves(np.logspace(-3,0,4),
lambda param: make_pipeline(
PCA_preproc,
sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1)),
x,y,
'Logistic Regression with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves(np.logspace(-3,2,10),
lambda param: make_pipeline(
PCA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param))),
x,y,
'Ridge Regression with PCA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
name,prec,rec,F_one,p = roc_curves(np.logspace(-3,2,10),
lambda param: make_pipeline(
LDA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param))),
x,y,
'Ridge Regression + LDA',
TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
results_df = pd.DataFrame(results)
results_df.sort_values(by='F1-Score',ascending=False)
Looking at theses results we decide to focus on three different models that seemed to be the most promising for our prediction task:
Before moving forward we were interested in the robustness of the model regarding the number of days between training and testing. That would tell us for how long can we use the model before retraining it.
To work on this test we decided to take the most recent sample of CPEs we would have in order to have more sick CPEs examples and therefore to leverage as much as possible the data that was being collected during this period.
DATA_FOLDER = './Data'
s_day = '16_05'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'
extracted = import_sample(DATA_PATH,BACKUP_PATH)
x_extracted, y_extracted = get_ml_data(extracted)
y = convert_to_binary_labels(y_extracted)
x_encoded = encode_categorical(x_extracted)
x = remove_features(x_encoded,verbose = True).values
In order to do that we will create a training dataset and an hold-out set. To build this Hold-out set we will keep 3 days of data in the hold out and will look at the accuracy on this hold out set for different gaps between training and testing.
def get_performance_gap_hold_out(extracted,ho_n_days,gap_n_days,clf_pipeline,top_ratio = 0.2):
"""
Allow us to compute naively classification performance metrics on a hold out set
that is gap_n_days away from the training dataset days.
@params:
- extracted: unreprocessed dataset obtained from import_sample
- ho_n_days: the number of days we wish to have in our hold out set
- gap_n_days: the number of days of difference between the training dates and testing dates
- clf_pipeline: an instantiated classification pipeline
- top_ratio: the ratio of top predictions we wish to use in order to test our model
NB: the function makes the assumption that the dates are consecutive inside the sample.
"""
# Compute the dates we wish to work with
dates = extracted['day_0'].unique()
dates.sort()
ho_days = dates[-(1+ho_n_days):-1]
dataset_days = dates[:len(dates)-(ho_n_days+gap_n_days+1)]
# get the two datasets
dataset = extracted[extracted['day_0'].isin(dataset_days)]
hold_out = extracted[extracted['day_0'].isin(ho_days)]
print('Hold Out population = {}\nTraining Population = {}'.format(len(hold_out),len(dataset)))
# prepare these datasets
x_extracted, y_extracted = get_ml_data(dataset)
y = convert_to_binary_labels(y_extracted)
x_encoded = encode_categorical(x_extracted)
x = remove_features(x_encoded,verbose = False).values
x_s,y_s = get_balanced_classes(x,y)
clf_pipeline.fit(x_s,y_s)
x_extracted_ho, y_extracted_ho = get_ml_data(hold_out)
y_ho = convert_to_binary_labels(y_extracted_ho)
x_encoded_ho = encode_categorical(x_extracted_ho)
x_ho = remove_features(x_encoded_ho,verbose = False).values
x_s_ho,y_s_ho = get_balanced_classes(x_ho,y_ho)
preds = clf_pipeline.predict_proba(x_s_ho)
return get_metrics(preds,y_s_ho,top_ratio)
clf = make_pipeline(
PCA_preproc,
RandomForestClassifier(n_estimators = 5000, random_state = 42,n_jobs=-1))
P,R,F1 = get_performance_gap_hold_out(extracted,5,0,clf)
print('With {} days of hold out between the training and testing and with 2 days in the testing set:\nP = {:.2f}%\tR = {:.2f}%\tF1 = {:.2f}%'.format(2,P,R,F1))
This result was quite surprising. We realized that we may have been wrong by assuming that we could cross validate on the dataset composed of all vectors regardless of their day_0. Even though this method doesn't perform a cross validation we can see that the performance were clearly overly optimistic and that we need to change the way we evaluate the performance of our classifier.
Regarding the possible causes of this drop in performance we could argue that the model may be overfitting particular dates and therefore when such dates are both in the testing and training the performance get abnormally high.
Hence we will now use a K-Fold that will consider distinct days in each fold and therefore will always train test on distinct sets of dates which will ensure that we do not over estimate the performance of the classifier in the real use case (if this predictions must be performed on a daily basis it seems trivial to understand that we will not have access to the calls of the day on which we are trying to predict and therefore the vectors for this particular day_0 won't be in the training set).
Again we use the most recent sample that is available to us.
x,y,dates = usable_data('22_05',DATA_FOLDER)
distinct_date_split(x,y,dates,k=2)
We also saw that training on a balanced dataset improved the performance of a clustering algorithm and it was therefore thought that it would be better to also balance somehow our training sets. We also thought that to be similar to a regular cross validation it would desirable to shuffle the two sets therefore we shuffled the dates among the different folds.
To be able to perform a comparison between this new cross validation and the other we have added in all the evaluation methods used previously the possibility to use this new cross validation.
# Old method
roc_curves([1000],
lambda param: make_pipeline(
PCA_preproc,
RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1)),
x,y,
'PCA + standard scaling + RF',
TOP_RATIO,dates=dates,distinct_date=False)
# New method
roc_curves([1000],
lambda param: make_pipeline(
PCA_preproc,
RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1)),
x,y,
'PCA + standard scaling + RF',
TOP_RATIO,dates=dates,distinct_date=True)
We can see that even now that we have a cross validated measurement the drop in performance remains important. Also it might seem very useless to have a precision of less than $50\%$ (as we thought at first) but the thing is that we are talking about the Precision on the sick class and not the accuracy of our classifier.
Let's put things in perspective:
# Let's look at the number of sick CPEs per day
df = pd.DataFrame()
df['date'] = dates
df['sick'] = y
df.groupby('date').sum().mean()
The previous analysis shows that we have on average 103 sick routers every day. One should keep in mind that we are working on subsample of the dataset that contains all the sick routers but only a small fraction of the healthy ones. From our SQL data collection there are approximately 500k distinct routers that we collect every day.
Therefore if an operator would have to check randomly a router from the full network he would have a probability of $\approx \frac{100}{500000} = 0.02\%$ to find a sick CPE. If he checks from the subset of routers we declare as being sick he now has a probability of $\approx 49\%$ to find a sick one.
That is, to our belief a significant improvement and we have decided to keep exploring the models using such technique in order to see what performance we could achieve on top predictions.
for now we are simply considering our top $20\%$ predictions based on their probability to belong to the sick class and we would like to have a cutoff_proba instead to select only the predictions that we are the most sure about.
Therefore we will look at the evolution of the precision and recall for different models such that we can encapsulate the trade-off between the two. Either we catch more sick CPEs but have to drop the precision of our prediction (="the number of CPEs that we declare as being sick and that are indeed sick") or we will catch less but will be (more) sure about the ones we declare as being sick.
In order to evaluate our models we will look at the highest level of recall that can be achieved for a desired precision. This will in vulgarized terms tell us the proportion of CPEs that we can achieve depending on how sure we wish to be of our prediction.
We have to use such evaluation as the true precision-recall curve cannot be approximated in an unbiased manner using a cross validation.
n_est_range = [10,20]
pipeline = lambda param: make_pipeline(
PCA_preproc,
RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1))
precision_recall_curves(n_est_range,pipeline,x,y,'PCA RF',[0.8,0.9],dates)
On these subplots the first line gives the average recall for different precision levels. E.g. for a random forest with 20 trees:
Deciding on the level of precision was done in accordance with the Business Intelligence team that determined that those would be more helpful to perform business decisions later on.
We can do the same kind of exploration than the one that we performed using the old technique, but this time we will have an trustworthy estimate of the performance of the classifier.
precision_levels = [0.7,0.8,0.9]
results = []
create_result_dict = lambda name,seventy,eighty,ninety,auc,p: {'Method':name,
"Recall for 70% Acc":seventy,
"Recall for 80% Acc": eighty,
"Recall for 90% Acc":ninety,
"AUC":auc,
"Parameter":p}
As it will be explained in the Understanding Classification section we have realized that dimensionality reduction might actually be hurting the performance of classifiers.
simple_preproc = make_pipeline(
sklearn.preprocessing.Imputer(strategy='mean'),
sklearn.preprocessing.StandardScaler())
params = [50,100,1000,5000]
pipeline = lambda param: make_pipeline(
simple_preproc,
RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Random Forests',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = [100,1000,5000]
pipeline = lambda param: make_pipeline(
PCA_preproc,
RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Random Forests with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = [100,1000,5000]
pipeline = lambda param: make_pipeline(
LDA_preproc,
RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Random Forests with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = np.logspace(-3,0,10)
pipeline = lambda param: make_pipeline(simple_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1',dual=False,C=param,random_state=42, max_iter=1000)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Linear SVM',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(PCA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1',dual=False,C=param,random_state=42, max_iter=1000)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Linear SVM with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(LDA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1',dual=False,C=param,random_state=42, max_iter=1000)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Linear SVM with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = np.logspace(-1,0,4)
pipeline = lambda param: make_pipeline(
simple_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Non Linear SVM',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(
PCA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Non Linear SVM with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(
LDA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Non Linear SVM with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = [50,100,1000,5000]
pipeline = lambda param: make_pipeline(simple_preproc,sklearn.ensemble.GradientBoostingClassifier(n_estimators=param))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Gradient Boosting',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = [500,1000,5000]
pipeline = lambda param: make_pipeline(PCA_preproc,sklearn.ensemble.GradientBoostingClassifier(n_estimators=param))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Gradient Boosting with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = [500,1000,5000]
pipeline = lambda param: make_pipeline(LDA_preproc,sklearn.ensemble.GradientBoostingClassifier(n_estimators=param))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Gradient Boosting with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = np.linspace(1,100,10)
pipeline = lambda param: make_pipeline(simple_preproc,sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'kNN',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(PCA_preproc,sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'kNN with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(LDA_preproc,sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'kNN with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = np.logspace(-3,2,20)
pipeline = lambda param: make_pipeline(
simple_preproc,
sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Logistic Regression',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(
PCA_preproc,
sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Logistic Regression with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(
LDA_preproc,
sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Logistic Regression with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
params = np.logspace(-3,2,10)
pipeline = lambda param: make_pipeline(
simple_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Ridge Regression',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(
PCA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Ridge Regression with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
pipeline = lambda param: make_pipeline(
LDA_preproc,
sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Ridge Regression with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
results_df = pd.DataFrame(results)
results_df.sort_values(by='Recall for 90% Acc',ascending=False)
On the previous plots it is important to note that the darker curve isn't the mean of the the curve for each fold and this is due to the difficulty to perform interpolation in the PR space. There exist another technique with is called score concatenation, it proposes here to accumulate all the y_pred and all the y_real of each fold and to then plot the curve by considering the accumulates as the real dataset of predicted and real target. This allow us to display some characteristics that we wish to use to compare different models:
Before we could go further we decided in accordance with the Business Intelligence team to perform some checks in order to understand what are the features that carry the most importance for the best classifier. That would allow us to perform a reality check with the HFC support team.
Because many classifiers have similar level of performance (we are mostly looking at the Recall for 90% Accuracy) we decided to use the top one Gradient Boosting also because it has a built in function to get the feature importance.
Because we were interested in the initial components and not the PCA components importance we tried a couple of things including looking at the feature importance without applying an initial PCA. That showed us that without PCA the model was performing much better. :
We realized our mistake of not trying first without any dimensionality reduction during the classification. And we decided to add this preprocessing to our model exploration (for simplicity we have already added that in the Model exploration above and will simply.
# we import the data again in order to obtain the features name that are deleted in xb
s_day = '22_05'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'
extracted = import_sample(DATA_PATH,BACKUP_PATH)
dates = extracted['day_0'].values
x_extracted, y_extracted = get_ml_data(extracted)
x_encoded = encode_categorical(x_extracted)
x_df = remove_features(x_encoded,verbose = True)
clf = make_pipeline(simple_preproc,
sklearn.ensemble.GradientBoostingClassifier(n_estimators=50))
clf.fit(x,y)
features = list(x_df.columns)
feature_analysis = pd.DataFrame()
feature_analysis['feature'] = features
feature_analysis['importance'] = clf.steps[1][1].feature_importances_
feature_analysis = feature_analysis.sort_values(by='importance',ascending=False)
feature_analysis.head(4)
n_useless = len(feature_analysis[feature_analysis['importance'] == 0])
print('There are {} features that are not used by the classifier'.format(n_useless))
n_top = 4
to_analyse = list(feature_analysis['feature'].head(n_top).values)
df = pd.DataFrame(x_df.filter(items=to_analyse))
df['sick'] = y
df.head(4)
compare_distrib(to_analyse[0],df)
compare_distrib(to_analyse[1],df)
compare_distrib(to_analyse[2],df)
compare_distrib(to_analyse[3],df)
On these four variables we have decided to bin the range of values that was taken by the variable and plotted side by side the distribution of the variable among the sick and healthy population. One of the most interesting point from this analysis is the fact that the most important feature is the offline_pct and it shows that sick CPEs have a much greater chance of being offline at least one hour during the last 24 hours than healthy CPEs.
That is also interesting as we also have in the feature the offline percentage over the last 24 hours by 6 hour windows. Hence this could indicate that we do not have the same response time between the problem happens and the call for all users.
But also it shows us that the fact that the CPE went offline is a great indicator of sickness and therefore that we may consider that a CPE is offline when it is experiencing problems (that could influence the way that we use predictions to proactively act on the CPEs)
We have seen during the model exploration phase that Gradient Boosting shows promising results compared to other models. Therefore we will try to tune this model as much as possible. In order to perform such tuning we need to define a scoring function.
While the ROC-AUC (Area Under the ROC Curve) seems to be used fairly often to measure the performance of classifier we are here in a case that corresponds to anomaly detection and therefore the ROC curve is less powerful as it cannot well discriminate when the number of TN is much higher than TP. Therefore we will use this metric in order to score our classifiers.
Nevertheless as explained earlier we are really focusing on a particular range of precision as we have no business use for low precision and high recall. Therefore we have decided to use the partial Area under the Precision Recall Curve such that we get a single number summarizing how does the classifier performs on the leftmost part of the curve.
Rk: while we have been focusing so far on the Recall for 90% accuracy to discriminate models that will give us a better way to fine-tune a given model
The general approach is to fix the learning rate and the number of estimators for tuning tree based parameters. Indeed there are a lot of parameters that can be modified therefore we need a strategy to avoid an exponentially growing number of possible combination.
learning_rates = np.linspace(0.05,0.3,5)
n_estimators = [int(x) for x in np.linspace(20,100,10)]
params = {'learning_rate': learning_rates, 'n_estimators':n_estimators}
estimator = lambda kw_args: make_pipeline(simple_preproc,
sklearn.ensemble.GradientBoostingClassifier(**kw_args))
res = custom_GridSearchCV(x,y,dates,estimator,params,cv=5)
Following the advices given here we will use the previously determined optimal Boosting parameters to grid search for the tree specific parameters in the following order:
optimal_args = res['best']['best_args']
opt_learning_rate = optimal_args['learning_rate']
opt_n_est = optimal_args['n_estimators']
params = {'learning_rate': [opt_learning_rate], 'n_estimators': [opt_n_est],
'max_depth':range(5,16,2), 'min_samples_split':range(200,1001,200)}
res = custom_GridSearchCV(x,y,dates,estimator,params,cv=5)
optimal_args = res['best']['best_args']
print('Our Grid Search reveiled that the following are the optimal parameters')
print('\tBoosting Parameters:\n\t\tNumber of estimators: {}\n\t\tLearning Rate: {}'.format(optimal_args['n_estimators'],optimal_args['learning_rate']))
print('\tTree Based Parameters:\n\t\tMaximum depth: {}\n\t\tMinimun samples splits: {}'.format(optimal_args['max_depth'],optimal_args['min_samples_split']))
# save the model to disk
import pickle,os
filename = DATA_FOLDER + '/final_model_untrained.sav'
if(not os.path.isfile(filename)):
pickle.dump(estimator(res['best']['best_args']), open(filename, 'wb'))
clf = estimator(res['best']['best_args'])
else:
# load model
clf = pickle.load(open(filename, 'rb'))
single_precision_recall_curve(estimator(res['best']['best_args']), x, y, dates,'')
single_roc_curve(x,y,estimator,res['best']['best_args'],top_ratio = 0.2,distinct_date=True,dates=dates)
In this section we will put the different analysis that have been done on the model, they didn't necessarily took place once we had the final model but we judged that they were not adding a lot of information to a potential reader in the process that has been described above (they didn't influence the choices that we have made)
We didn't have so far the possibility to check the output of our model because of the lack of resources that the HFC support team could dedicate to this project. We waited the very last time to perform a prediction on the day where we would meet the team in order to check whether they would see a problem with the predicted CPEs or not.
To perform such check we had to use a sample from the database that was computed on the same day (describing the health measurements relative to day_0 being the previous day) than the meeting but also we had to modify the way we build our datasets.
Beforehand we used to delete the days where there are no sick CPEs but indeed the 8 last days of our sample are always without any sick CPEs as we do not yet know what are the FTR (first time resolution) flags yet
s_day = '11_06'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'
full_extracted = import_sample(DATA_PATH,BACKUP_PATH,delete_only_healthy_days=False)
full_dates = full_extracted['day_0'].values
# we get the most recent day to build the test set
test_extracted = full_extracted[full_extracted['day_0'] == np.max(np.unique(full_dates))]
x_extracted_te, y_extracted_te = get_ml_data(test_extracted)
x_encoded_te = encode_categorical(x_extracted_te)
x_te_df = remove_features(x_encoded_te,verbose = True)
x_te = x_te_df.values
We do need to have the labels for the training set and therefore we use our usual technique to get the data ready for machine learning:
x_tr,y_tr,dates = usable_data('11_06',DATA_FOLDER)
What the Precision Recall curve shows us is also the probability at which we need to cutoff in order to consider a CPE healthy or sick depending on the levels of (Precision,Recall) we wish to obtain (on average).
single_precision_recall_curve(estimator(res['best']['best_args']), x_tr, y_tr, dates,'')
# fit the model
clf = estimator(res['best']['best_args'])
clf.fit(x_tr,y_tr)
# predict
cutoff = 0.7556
sick_probas = clf.predict_proba(x_te)[:,1]
test_extracted['sick_proba'] = sick_probas
test_extracted[test_extracted['sick_proba'] >= cutoff]
Because we didn't have any CPE that was satisfying the most discriminative cutoff, we decided to go for the second most discriminative in order to check some suspected CPEs with the HFC support team.
Because we only had 1 CPE at hand, we decided to test on more dates as the team also has access to an history up to 10 days in the tools that help them to give us a diagnostic.
# we get all the dates that won't be in the training set
test_extracted = full_extracted[full_extracted['day_0'].isin(np.setdiff1d(full_dates,dates))]
x_extracted_te, y_extracted_te = get_ml_data(test_extracted)
x_encoded_te = encode_categorical(x_extracted_te)
x_te_df = remove_features(x_encoded_te,verbose = True)
x_te = x_te_df.values
sick_probas = clf.predict_proba(x_te)[:,1]
test_extracted['sick_proba'] = sick_probas
test_extracted[test_extracted['sick_proba'] >= cutoff]
We decided to check 4 different devices to see if they indeed experienced some problem:
546751F40E59 ($p = 0.774174$): A HFC interface keeps resetting (a lot), during this period the customer has no connection. Probably a hardware modem issue. True PositiveAC22059AC221 ($p = 0.892083$): Went offline for an extended amount of time. The only problem seems to be a lot of missing values, there are a lot of timeouts and it might be a data collection issue. False PositiveC427951777A7 ($p = 0.821313$): Lots of packets are dropped on one of the channel, this can result for the customer in performance issues (service degradations). It is most probably a cabling problem that he can solve himself if we send him a repalcement cable. True Positive905C44BF6396 ($p = 0.783860$): We can see that levels were down for an extended period of time and went back up, which should indicate a technician intervention. We couldn't find it but given the similarities we can conclude that this is a True PositiveHFC support expressed a lot of satisfaction with respect to these results as out of the 4 predictions we analyzed only one was a false positive and it was still displaying irregularities.
We got curious regarding what was the influence of the number of days that we waited between the training and prediction on the performance of the classifier.
Because we do not have that many days of data we will limit this analysis but the main idea is to see whether the model needs to be retrained every day or every 3 days, every week etc.
dates_to_consider = get_longest_date_seq(dates)
With 41 days to work with we now need to decide on a strategy to use such dates. We would like to see the influence of the gap period $\in\{0,...7\}$ (= difference between last day of training and first day of testing).
We want to keep the training set size a constant such that it doesn't influence the performance of the model
Also we would like to repeat the operation multiple times in order to have a representative metric in the end. What we will do is try to use 25% of the dataset to test for the largest gap length. Here we want to have 7 days at most of gap:
Therefore when gap_length = 7, what we do is that the first training round will take:
and when gap_length = 1, the first round is:
Then in order to reduce the number of trainings that take time we see that for each gap_length:
true_y and score_y that are computed at each round to have a good estimate of the performance.plot_gap_performance(x_tr,y_tr,dates,clf,list(range(0,8)))
plot_gap_performance(x_tr,y_tr,dates,clf,[0,3,7,12])
We can see that there is no significant difference (apart from the variance due to the dataset, we see that we cannot observer a trend where the more days we have between training and predicting the more the performance decrease) between the performance of the models whatever the time gap between training and testing which is reassuring as it means that we have already taken out a good part of the seasonality of the data. But it also ensures that it is okay that the training set is in real life 8 days younger than the testing set (because of the FTR flag)
We were fairly curious regarding how our initial cross validation strategy was not estimating correctly the performance of our classifier. Our main guess was that the vectors still contained some sort of seasonality that we couldn't get rid of during the aggregation of the data.
Because we didn't have time to investigate deeply the problem we simply tried one thing which was to see if by training the model on the same week day than the day of testing we would get an overfitted model that would have strangely high performance
date_string = '11_06'
data_location = DATA_FOLDER
data_path = data_location + '/sample_'+date_string+'.xlsx'
backup_path = data_location+'/sample_'+date_string+'.pk'
extracted = import_sample(data_path,backup_path)
dates = extracted['day_0'].values
x_extracted, y_extracted = get_ml_data(extracted)
y = convert_to_binary_labels(y_extracted)
x_encoded = encode_categorical(x_extracted)
x_df = remove_features(x_encoded,verbose = True)
weekday_influence(x_df,y,dates,clf)
What we observe is that the model is much less stable on weekends but it seems like during the rest of the week the performance of the classifier doesn't increase drastically and therefore we do not have a clear sign of overfitting on the weekdays.